[air] `pyarrow.fs` persistence (7/n): `ray.train.Checkpoint` restore: Auto-recovery fault tolerance #38141

justinvyu · 2023-08-04T23:45:54Z

Why are these changes needed?

This PR handles the auto-restoration fault tolerance direction for the new Checkpoint API:

The latest _TrainingResult(checkpoint, metrics) data saved in the trial state on the driver gets sent to the workers for restoration.
No checkpoint data gets downloaded during restoration.
The user can access the checkpoint with to_directory and as_directory.

This PR also fixed a race condition in as_directory: the deletion lock should be set before the internal call to to_directory. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

Other comments

Here were some other ideas for restoring the checkpoint index:

Store it inside the _TrainingResult when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]> Update to use the new checkpoint id attribute Signed-off-by: Justin Yu <[email protected]> Add todo comment to remove legacy path Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

… -> driver Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]> Fix lint Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

…persistence/storage_context_to_worker_temp

Signed-off-by: Justin Yu <[email protected]>

…persistence/new_checkpoint Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]> Fix lint for session.py Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]> Fix lint for storage.py Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

…persistence/restore_new_checkpoint_autoft

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2023-08-07T21:51:35Z

@ericl This one is ready for a 2nd round. Then have 2 more lined up to finish restoration for trainers.

ericl · 2023-08-07T22:06:30Z

Just a small remaining comment.

Signed-off-by: Justin Yu <[email protected]>

…persistence/restore_new_checkpoint_autoft

Signed-off-by: Justin Yu <[email protected]>

ericl

Lgtm

Signed-off-by: Justin Yu <[email protected]>

…persistence/restore_new_checkpoint_autoft

Signed-off-by: Justin Yu <[email protected]>

…persistence/restore_new_checkpoint_autoft

justinvyu · 2023-08-08T18:11:29Z

@ericl Good to merge now

… Auto-recovery fault tolerance (ray-project#38141) This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API: - The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration. - No checkpoint data gets downloaded during restoration. - The user can access the checkpoint with `to_directory` and `as_directory`. This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors.. ### Other comments Here were some other ideas for restoring the checkpoint index: 1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number. 2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.** Signed-off-by: NripeshN <[email protected]>

… Auto-recovery fault tolerance (ray-project#38141) This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API: - The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration. - No checkpoint data gets downloaded during restoration. - The user can access the checkpoint with `to_directory` and `as_directory`. This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors.. ### Other comments Here were some other ideas for restoring the checkpoint index: 1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number. 2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.** Signed-off-by: e428265 <[email protected]>

… Auto-recovery fault tolerance (ray-project#38141) This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API: - The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration. - No checkpoint data gets downloaded during restoration. - The user can access the checkpoint with `to_directory` and `as_directory`. This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors.. ### Other comments Here were some other ideas for restoring the checkpoint index: 1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number. 2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.** Signed-off-by: Victor <[email protected]>

justinvyu added 30 commits July 27, 2023 14:24

Pipe storage context to Trainable (used now for Trainable syncing)

abb1307

Signed-off-by: Justin Yu <[email protected]>

Don't use the storage context in the trial/trainable

f6ff90a

Signed-off-by: Justin Yu <[email protected]>

Disable all trainable syncing in new codepath

562369f

Signed-off-by: Justin Yu <[email protected]>

Pipe storage context to Train workers (not actually used yet)

95a3d20

Signed-off-by: Justin Yu <[email protected]> Update to use the new checkpoint id attribute Signed-off-by: Justin Yu <[email protected]> Add todo comment to remove legacy path Signed-off-by: Justin Yu <[email protected]>

Fix race condition for setting checkpoint_uri

484e67f

Signed-off-by: Justin Yu <[email protected]>

Fix cyclical import

2148669

Signed-off-by: Justin Yu <[email protected]>

Add simple trainer test

8c856b8

Signed-off-by: Justin Yu <[email protected]>

Add legacy prefix to train session checkpoint uri

78c525f

Signed-off-by: Justin Yu <[email protected]>

Add new checkpoint class

e97f471

Signed-off-by: Justin Yu <[email protected]>

New train session report implementation using new checkpoint

64945be

Signed-off-by: Justin Yu <[email protected]>

Simplify checkpoint propagation from user code (in worker) -> trainer…

c6480c9

… -> driver Signed-off-by: Justin Yu <[email protected]>

New tune session.report

c681ccb

Signed-off-by: Justin Yu <[email protected]>

Save direction works with new checkpoint API

795bafe

Signed-off-by: Justin Yu <[email protected]> Fix lint Signed-off-by: Justin Yu <[email protected]>

Update test with e2e trainer test

8a084bc

Signed-off-by: Justin Yu <[email protected]> Fix lint Signed-off-by: Justin Yu <[email protected]>

Make callback supporting new checkpoint a todo for now

725d802

Signed-off-by: Justin Yu <[email protected]>

Remove unnecessary comment

877acb9

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

ee4ccbd

…persistence/storage_context_to_worker_temp

Separate out the new set checkpoint id from the old set checkpoint uri

88042b3

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

a5eeab2

…persistence/new_checkpoint Signed-off-by: Justin Yu <[email protected]>

Update id -> index

a6cd9dc

Signed-off-by: Justin Yu <[email protected]>

Address comments on error to raise with old ckpt type

01f34bb

Signed-off-by: Justin Yu <[email protected]>

Move checkpoint upload logic to a helper fn of storage ctx

65e7a27

Signed-off-by: Justin Yu <[email protected]> Fix lint for session.py Signed-off-by: Justin Yu <[email protected]>

Drop a checkpoint marker after uploading

f2a4c36

Signed-off-by: Justin Yu <[email protected]> Fix lint for storage.py Signed-off-by: Justin Yu <[email protected]>

Add a simplified checkpoint manager

49ee126

Signed-off-by: Justin Yu <[email protected]>

Fixes to checkpoint manager

ffa0dd4

Signed-off-by: Justin Yu <[email protected]>

Add unit test for simplified checkpoint manager

15553f7

Signed-off-by: Justin Yu <[email protected]>

Full test coverage

00cc9d7

Signed-off-by: Justin Yu <[email protected]>

Add a simplified checkpoint manager

cb5990e

Signed-off-by: Justin Yu <[email protected]>

Fixes to checkpoint manager

2db9aae

Signed-off-by: Justin Yu <[email protected]>

Add unit test for simplified checkpoint manager

a2067b7

Signed-off-by: Justin Yu <[email protected]>

justinvyu added 6 commits August 7, 2023 13:44

Fix restore info log

9556371

Signed-off-by: Justin Yu <[email protected]>

Keep current checkpoint index synchronized on the driver

e897eaa

Signed-off-by: Justin Yu <[email protected]>

Remove checkpoint dirname parsing

3eef417

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

0e52384

…persistence/restore_new_checkpoint_autoft

Update todo comment

89631ab

Signed-off-by: Justin Yu <[email protected]>

Fix lint

d4e20f2

Signed-off-by: Justin Yu <[email protected]>

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 7, 2023

justinvyu requested a review from ericl August 7, 2023 21:26

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 7, 2023

justinvyu added 3 commits August 7, 2023 16:36

Rename to starting_checkpoint

72aa1fb

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

fb056f8

…persistence/restore_new_checkpoint_autoft

Fix lint

ca0df9f

Signed-off-by: Justin Yu <[email protected]>

ericl approved these changes Aug 7, 2023

View reviewed changes

justinvyu added 4 commits August 8, 2023 00:22

fix typo

3636a21

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

8743a99

…persistence/restore_new_checkpoint_autoft

Fix repr

a0c5a26

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

0a40c47

…persistence/restore_new_checkpoint_autoft

justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 8, 2023

ericl merged commit d13ba07 into ray-project:master Aug 8, 2023
2 checks passed

justinvyu deleted the air/persistence/restore_new_checkpoint_autoft branch August 8, 2023 18:50

ericl mentioned this pull request Aug 10, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (7/n): `ray.train.Checkpoint` restore: Auto-recovery fault tolerance #38141

[air] `pyarrow.fs` persistence (7/n): `ray.train.Checkpoint` restore: Auto-recovery fault tolerance #38141

justinvyu commented Aug 4, 2023 •

edited

Loading

justinvyu commented Aug 7, 2023

ericl commented Aug 7, 2023

ericl left a comment

justinvyu commented Aug 8, 2023

[air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore: Auto-recovery fault tolerance #38141

[air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore: Auto-recovery fault tolerance #38141

Conversation

justinvyu commented Aug 4, 2023 • edited Loading

Why are these changes needed?

Other comments

Related issue number

Checks

justinvyu commented Aug 7, 2023

ericl commented Aug 7, 2023

ericl left a comment

Choose a reason for hiding this comment

justinvyu commented Aug 8, 2023

[air] `pyarrow.fs` persistence (7/n): `ray.train.Checkpoint` restore: Auto-recovery fault tolerance #38141

[air] `pyarrow.fs` persistence (7/n): `ray.train.Checkpoint` restore: Auto-recovery fault tolerance #38141

justinvyu commented Aug 4, 2023 •

edited

Loading